feat: add batch inference API to llama stack inference by ashwinb · Pull Request #1945 · llamastack/llama-stack

ashwinb · 2025-04-11T22:21:41Z

What does this PR do?

This PR adds two methods to the Inference API:

batch_completion
batch_chat_completion

The motivation is for evaluations targeting a local inference engine (like meta-reference or vllm) where batch APIs provide for a substantial amount of acceleration.

Why did I not add this to Api.batch_inference though? That just resulted in a lot more book-keeping given the structure of Llama Stack. Had I done that, I would have needed to create a notion of a "batch model" resource, setup routing based on that, etc. This does not sound ideal.

So what's the future of the batch inference API? I am not sure. Maybe we can keep it for true asynchronous execution. So you can submit requests, and it can return a Job instance, etc.

Test Plan

Run meta-reference-gpu using:

export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct
export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000
export MODEL_PARALLEL_SIZE=4
export MAX_BATCH_SIZE=32
export MAX_SEQ_LEN=6144

LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu

Then run the batch inference test case.

ehhuang

How much is the speed up? Just curious.

llama_stack/providers/inline/inference/meta_reference/inference.py

ashwinb · 2025-04-11T23:12:37Z

How much is the speed up? Just curious.

I will calculate some aggregate toks/sec values by running a bunch of examples (from evals) linearly vs batch.

ashwinb · 2025-04-12T14:56:27Z

Corresponding llama-stack-client changes: llamastack/llama-stack-client-python#220

See llamastack/llama-stack#1945

ashwinb · 2025-04-12T18:39:35Z

@ehhuang Here are some numbers for various batch sizes running for 100 samples of the BFCL benchmark:

llama-4-scout

batchsize	time
1	2:36
8	3:55
16	2:50
32	1:46

llama-3.3-70b

batchsize	time
1	3:16
8	4:38
16	3:18
32	2:02

My conclusion: this batch inference implementation is far from "effective" in terms of accelerating inference substantially. However, it is a good first step. Most of the work in the PR is infrastructure. Furthermore, we can now connect to vLLM's (inline) batch APIs when needed.

# What does this PR do? This PR adds two methods to the Inference API: - `batch_completion` - `batch_chat_completion` The motivation is for evaluations targeting a local inference engine (like meta-reference or vllm) where batch APIs provide for a substantial amount of acceleration. Why did I not add this to `Api.batch_inference` though? That just resulted in a _lot_ more book-keeping given the structure of Llama Stack. Had I done that, I would have needed to create a notion of a "batch model" resource, setup routing based on that, etc. This does not sound ideal. So what's the future of the batch inference API? I am not sure. Maybe we can keep it for true _asynchronous_ execution. So you can submit requests, and it can return a Job instance, etc. ## Test Plan Run meta-reference-gpu using: ```bash export INFERENCE_MODEL=meta-llama/Llama-4-Scout-17B-16E-Instruct export INFERENCE_CHECKPOINT_DIR=../checkpoints/Llama-4-Scout-17B-16E-Instruct-20250331210000 export MODEL_PARALLEL_SIZE=4 export MAX_BATCH_SIZE=32 export MAX_SEQ_LEN=6144 LLAMA_MODELS_DEBUG=1 llama stack run meta-reference-gpu ``` Then run the batch inference test case.

ashwinb requested review from SLR722, dineshyv, dltn, ehhuang, hardikjshah, leseb, raghotham, sixianyi0721, terrytangyuan, vladimirivic and yanxi0830 as code owners April 11, 2025 22:21

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 11, 2025

ehhuang approved these changes Apr 11, 2025

View reviewed changes

llama_stack/providers/inline/inference/meta_reference/inference.py Outdated Show resolved Hide resolved

ashwinb added 2 commits April 11, 2025 16:18

feat: add batch inference API to llama stack inference

0cfb2e2

updates

73d9278

ashwinb force-pushed the batchinfer branch from 86b7535 to 73d9278 Compare April 11, 2025 23:18

ashwinb added 2 commits April 11, 2025 16:21

kill batch inference registry

1d85546

kill experimental attr on webmethod

a3cee70

ashwinb mentioned this pull request Apr 12, 2025

feat: add updated batch inference types llamastack/llama-stack-client-python#220

Merged

ashwinb added a commit to llamastack/llama-stack-client-python that referenced this pull request Apr 12, 2025

feat: add updated batch inference types (#220)

ddb93ca

See llamastack/llama-stack#1945

ashwinb added 2 commits April 12, 2025 10:51

fix test, fix llama3 generator

771daa4

include content in the message even if you have parsed out a tool call

14ff4c6

ashwinb force-pushed the batchinfer branch from 593782c to 14ff4c6 Compare April 12, 2025 18:27

ashwinb merged commit f34f22f into main Apr 12, 2025
22 checks passed

ashwinb deleted the batchinfer branch April 12, 2025 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add batch inference API to llama stack inference#1945

feat: add batch inference API to llama stack inference#1945
ashwinb merged 6 commits intomainfrom
batchinfer

ashwinb commented Apr 11, 2025

Uh oh!

ehhuang left a comment

Uh oh!

Uh oh!

ashwinb commented Apr 11, 2025

Uh oh!

ashwinb commented Apr 12, 2025

Uh oh!

ashwinb commented Apr 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ashwinb commented Apr 11, 2025

What does this PR do?

Test Plan

Uh oh!

ehhuang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ashwinb commented Apr 11, 2025

Uh oh!

ashwinb commented Apr 12, 2025

Uh oh!

ashwinb commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ashwinb commented Apr 12, 2025 •

edited

Loading